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Inferring the sequence of states from observations is one of the most fundamental problems 
in Hidden Markov Models. In statistical physics language, this problem is equivalent to 
computing the marginals of a one-dimensional model with a random external field. While this 
task can be accomplished through transfer matrix methods, it becomes quickly intractable 
when the underlying state space is large. 

This paper develops several low-complexity approximate algorithms to address this infer- 
ence problem when the state space becomes large. The new algorithms are based on various 
mean-field approximations of the transfer matrix. Their performances are studied in detail 
on a simple realistic model for DNA pyrosequencing. 



I. INTRODUCTION 



Hidden Markov Models (HMM's) are a workhorse of modern statistics and machine learning, 
with applications ranging from speech recognition to biological sequence alignment, to pattern 
classification [1, 2, 3]. An HMM defines the joint distribution over a sequence of states S_ = 
{Si, S2, ■ ■ ■ , Si, . . . , St), Si £ S, and observations Y_ = {Yi, Y2, . . . ,Yi, . . . ,Yt), whereby the states 
form a Markov chain and the observations are conditionally independent given the sequence of 
states. In formulae we have 

t t 

V[S,Y] = Pi{Si)llPi{Si\S^^i)YlQi{Yi\Si) . (1) 

i=2 i=l 

The most fundamental algorithmic task related to HMM's is arguably the problem of inferring 
the sequence of states (5*1, 5*2, ... , St) from the observations. The conditional distribution of the 
state sequence given the observations is, by Bayes theorem, 

^ t t 

IP [S\Y] = — ^Po(So) n P^iS^\S^^l) U Q^{Y,\S,) , (2) 
^^^^ ^=2 i=l 

where Z{S_) = P [Y] can be thought as a normalization constant. The state sequence can then be 
estimated by the sequence of most likely states (symbol maximum a posteriori probability -MAP- 
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estimation) 



5'j(y) = argmax < 



(3) 



This reduces the inference problem to the problem of computing marginals of P [<S|y]. 

From a statistical physics point of view [4], the conditional distribution (2) can be regarded as 
the Boltzmann distribution of a one dimensional system with variables Si, S2, ■ ■ ■ , St and energy 
function 



at temperature (3 = 1. The sequence of observations thus act as a quenched external field. As 
suggested by this analogy, the marginals of P [S_\Y] can be computed efficiently using a transfer 
matrix algorithm. In the present context this is also known as the Bahl-Cocke-Jelinek-Raviv 
(BCJR) algorithm [5]. 

The BCJR algorithm has complexity that is linear in the sequence length and quadratic in 
the number of states \S\. More precisely, the complexity in \S\ is the same as multiplying an 
|5| X \S\ matrix times an |5| vector. While this is easy for simple models with a few states, it 
becomes intractable for complex models. A simple mechanism leading to state space explosion is 
the presence of memory in the underlying Markov chain, or the dependence of each observation 
on multiple states. In all of these cases, the model can be reduced to a standard HMM via state 
space augmentation, but the augmented state space becomes exponential in the memory length. 
This leads to severe limitations on the tractable memory length. 

This paper proposes several new algorithms for addressing this problem. Our basic intuition is 
that, when the memory length gets large, the transfer matrix can be accurately approximated using 
mean field ideas. We study the proposed method on a concrete model used in DNA pyrosequencing. 
In this case, one is interested in inferring the underlying DNA sequence from an absorption signal 
that carries traces of the base type at several positions. The effective memory length scales roughly 
as the square root of the sequence length, thus making plain transfer matrix impractical. 

The paper is organized as follows. The next section will define the concrete model we study, 
and Section III describes the connection with DNA pyrosequencing to motivate it. Section IV 
describes the transfer matrix algorithm and several low complexity approximation schemes. After 
describing a few bounds in V, numerical and analytical results are collected in Section VI. 



t t 



E{S) = - log Pi{Si) -Y^log Pi{Si\S,_i) -J2^og Qi{Yi\Si) 



(4) 



i=2 i=l 
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II. MODEL AND DEFINITIONS 
A. Definition of the model 

Consider a sequence of t positive integers A = {Ai, . . . , At}. Each entry Aa of this sequence is 
generated randomly and independently from the others with probability distribution P{Aa)- This 
distribution has finite support, i.e. we introduce a positive integer c such that P{x) = if x > c. 

This sequence is observed through a non-recursive linear filter, i.e. each observation does not 
depend on any previous observation. That is, we observe the sequence Y_ = {Yi, . . . ,Yt} G M* 
defined by 

o 

Ya = ^a{i,a)Ai + r]a , (5) 
1=1 

where a{i,a) is what we call the memory function and tja is a Gaussian random variable with mean 
and variance a"^ which is drawn independently for each position a. The memory function also 
has a finite support. We introduce an integer n which represents the total memory length, i.e. we 
assume a{i, o) = when a — i > n. Therefore the sum on the right hand side of Eq. (5) effectively 
starts at max(l, a — n). There is no restriction on the sign of a(z, a). 

The relationship between the sequences A and Y_ can be described by a factor graph represen- 
tation. This is a bipartite graph including one function node for every Ya in Y_ and one variable 
node for every Aa in A. Except for the first n, every function node is connected to exactly n + 1 
variable nodes by as many edges. A schematical representation is presented in Fig. 1. 



a 




71+1 

FIG. 1: Portion of a factor graph representation of the relationship between the original sequence A 
(circles) and the sequence Y_ (squares) and where n = 3. 

Our goal is to develop an efficient algorithm to recover the sequence A = {Ai, . . . , At} from 
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the observed noisy sequence Y_ = {Ya, . . . ,Yt}. 



The sequence -A is a chain of i.i.d. variables therefore it can be regarded as zero-order Markov 
chain. The output Ya is observed from the underlying state Aa as well as the all the states preceding 
Aa- This problem can therefore be thought of as a variable length higher-order hidden Markov 
model (HMM) where the hidden states are i.i.d. and the observations depend on all previous 
hidden states ([1, 6, 7]). 

We will denote by X = {xi, . . . , xt} € [1, c]" an estimate of A. Using Bayes rule, the posterior 
probability of X knowing Y_ is 

We use the maximum a posteriori (MAP) method [8] to produce the estimation. The probability 
P [y |X] is known as the likelihood function and takes the form 

t 

^[Ym = ll^a{Xa-n,---,Xa), (7) 

a=l 

where, if the index of Xi is such that i < then we set Xj = 0. This sets the boundary condition 
for small a. For a > t, the variables Xa are not present in any equation and are considered free. 
The probability ^'a(xa-n, • • • j Xa) in Eq. (7) is the density of rja defined in Eq. (5) and is written 

2" 



^aiXa-n, ■ ■ ■ ,Xa) = F [Ya\{Xa-n, ■ ■ ■ , Xa}] = ^^-^ exp 

V 27r(T^ 



^ f - ^ a{i,a)xi j 

\ i=a—n / 



, (8) 



2a^ 

where is the variance of rja- The other terms in Eq. (6) are 

t 

a=l 

which is the prior distribution, and 

t 

^[Y]=Y.F [X] H ^a{Xa-n, • • • , X.) = , (10) 

X a=l 

which is a normalization constant. In fine, we construct the marginal distribution 

u{Xa)= Yl ^i^l^U (11) 
{xt/bj^a} 

which yields the decoded sequence as 

X* = ai[gmaic{i'{xa)} ■ (12) 
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which is the maximum hkeUhood estimate. We use here a symbol MAP decoding, with the hope 
of minimizing the error for each single Xa, instead of a block MAP decoding which would be to 
minimize the error over the sequence as a whole. 

The direct computation using this method entails a summation over 0(c*) terms which rapidly 
becomes unpractical when c and/or t grow. In section IV we introduce four separate algorithms 
with various levels of approximation to overcome this limitation. 

B. Specific forms of the prior and memory functions 

In this section we give some details about the sequences we use in our numerical simulations. 
To generate the integer sequence A we use several different probability distributions /3. The details 
of these will be described in subsection II B 1. There are also several a functions we will be using, 
these are described in subsection II B 2. 

1. The distribution (3 

Consider a sequence of i.i.d. Bernoulli variables taken in {0, 1} with success probability q. We 
then construct the integer sequence A by counting the number of repetitions of Os or Is in this 
Bernoulli sequence. For instance, if the Bernoulli sequence is 000101100011111 this will correspond 
to -A = {311235}. We can generate this sequence directly using the distribution 

(3g{l)=q{l-q)'-^ ,V/GN. (13) 

We will give more details on this particular distribution in section III. The sequence Y_ is then 
generated using Eq. (5). The distribution fig does not admit a finite support but it decays rapidly 
as / — > oo, thus we can still introduce an effective cutoff parameter c. 

An alternate approach is to truncate the distribution at large enough c. The corresponding 
truncated geometrical distribution is then defined as 

m = ltq{l-q)'-^ if/e [l,c], 

= otherwise, (14) 

where 7j~^ = j3g{l < c) = Yl'i=i lO- ~ qY'^ = 1 — (1 — qY is very close to one if c is large enough. 
This expression gives us a behavior similar to the one of Pg while enabling us to use the same 
distribution for coding and decoding. 
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The uniform distribution, called (3u, is 



Pu{l) = \ if 1</<C, 



otherwise. (15) 



2. The function a 

Here we will give the expressions we use in our different test cases for the function a defined in 
Eq. (5). 

The first expression we use for the function a is 

a.(^,a)=("-L^J)(l-p)L^Jp% (16) 

if there is a A; £ N such that i = a — 2k, otherwise arii, a) = 0. Furthermore, we have p £ [0, 1]. The 
derivation of this expression, as was the expression of Pg in the previous section, will be described 
in more detail in section III. The value of p effectively tunes the memory length, i.e. n ~ tp. 
We define two other functions that will be useful for test cases, as were Eqs. (14) and (15), 

af{i,a) = 1, (17) 

ah{i,a) = ^_^-j^2 ' ^^^^ 

for (a—i) < n and are equal to zero otherwise. These functions are useful as they are non-zero in the 
range [a — n — l,a] only and thus we have a better control over the precision of our approximation 
since we decide of the value of n for Eq. (5). 

When there is no subscript to a it can take any of these three values. 



III. THE PYROSEQUENCING EXAMPLE 

The original inspiration of this work comes from the Sequence-by-Synthesis technique, called 
pyrosequencing, originally introduced in [9] and described in [10, 11, 12, 13, 14] with a full review 
in [15] which gives a concrete and precise description of the method and its history. It is one way of 
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sequencing DNA strands by a repeated set of chemical tests. Since its introduction, it has become 
an industry standard for low cost high efficiency sequencing. This paper being, in the end, not 
directly connected to the technique we will simply give an introduction to its fundamentals. A 
more detailed review of the usage of our approach on experimental pyrosequencing data is planned 
as a future publication. 

In pyrosequencing, the initial solution contains many copies of the same strand, which is called 
the base sequence, and a set of enzymes that will catalyze and react with the by product of the 
main reaction to emit light. Tests are done by incorporating a repeated cycle of the 4 base type 
nucleotides into the solution. When a nucleotide is introduced it will react with the DNA sequences 
if the first available position of the base sequence is its complementary base (that is bases A and 
T on one side and bases G and C on the other). The reaction will happen for as long as this 
same base is repeated. Furthermore, the reaction will produce a readable response (a pulse of 
light) that is proportional to the total number of repetitions that were encountered, this is called a 
homopolymeric (HP) sub-sequence. Finally, all positions on the DNA strands that reacted will now 
be obstructed to subsequent tests and thus freeing the next available base in the base sequence for 
reaction. As an example. Fig. 2 shows a series of cycles applied to the sequence TTGAAAGCC. 

When a test is positive, the chemical reaction is incomplete, that means only a fraction of all 
the strand copies react. We call the average of this fraction the incorporation rate p £ [0)1]- 
Furthermore, it means there is a fraction 1 — p of all the copies that did not react to this test and 
that a fraction p of this fraction will react only at the next cycle, and so on for each test. Finally, 
it results that the responses will depend on this incomplete incorporation and which dependency 
can be simulated by the use of a memory function. In full rigour, there is an additional parameter 
called the non-specific incorporation rate [10]. It measures the average fraction of strands that 
react when the test is negative. We do not take this element into account in our model since its 
value is usually very small. 

An original mathematical description was introduced in [16] as a biochemical model. Using 
kinetic considerations, it investigates the differential equations describing the single pulse due to a 
single incorporation as well as a succession of pulses linked to as many incorporations. The model 
that is developed gives an effective description of pyrosequencing without approximation. 

If we denote hy A = {Ai, . . . ,At} the HP sub-sequences and Y_ = {Yi, . . . ,Yt} the response 
sequence, then in fact this work can be considered a slight adaptation of pyrosequencing in which 
we consider base sequences to be made up of only two base types. This model being binary, we 
naturally call them and 1. This also explains the inspiration for the definition of the distribution 
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cYc) + A) 




CAmcYc) +(G) 

(cmmmc) 



FIG. 2: Example of pyrosequencing test cycles to decode sequence TTGAAAGCC . The cycles follow the 
order A ^ T ^ C ^ G. A "— " sign means no reaction occurred, every sequence of "+" signs means a 
reaction occurred and its amplitude was multiplied by as many times. 

fig in Eq. (13) if we take q= \- In this case, we usually take the value of this cutoff parameter to 
be c = 15 since the probability for I being bigger than this value of c is P{1 > c) ~ 3.05 x 10~^. 

We keep the same definition for the incorporation rate p. The sequence of incomplete reactions 
can thus be plotted onto a directed acyclic graph (DAG) which can look like Fig. 3 (when the first 
tested base is represented by 0). 

For each test in our two base sequence, we have a fraction p which reacts and presents the 
other base in the next HP sub-sequence for the following test and a fraction \ — p which does not 
react and thus will not react with the subsequent test. We then count the vertices on the graph to 
obtain a. For instance, in Fig. 3 we have singled out the position of a(4, 6). Its value is obtained 
by adding the lengths of all the direct paths that lead from the starting point to its position. 
That is, there are 5 possible paths and they all have the same length of (1 — p)p^ and therefore 
a(4, 6) = 5(1 — p)p^ . By generalizing this to any position we obtain Qr(i, a) expressed in Eq. (16) 
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10 10 10 
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a(4,6) 

FIG. 3: Graph showing evolution of subgroups in response to tests when the first base in the sequence is a 

and this is also the first test performed. 

and by adding noise the pyrosequencing equivalent response Ya of test a is expressed in Eq. (5). 

0.4 I . < < . 1 0.7 I . . ^ ^ . ^ 1 




FIG. 4: On the left: ar{i, a) for a = 10, a = 40 and a = 100 (from left to right), and p = 0.9. On the right: 
ar{i, a) for a = 40, a = 100 and a = 140 (from left to right) and p = 0.99. In both figures: the solid lines 
are the envelopes of ar and the dotted lines show the alternating behavior of ar between zero and non-zero 

values. 

The value of the memory n introduced in section II A is linked to the graph of this function 
ar- We define n as the smallest integer so as to keep a certain percentage of the total weight of 
ar between t — n and t. The weight being here the sum of all values of ar{i,t) for i £ In 
practice, we keep at minimum 99% of the total weight. Furthermore, as seen in Fig. 4, the graph 
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of Or is such that we assume for all a and all i < i„ = a — n — 1 that we have ar{i, a) = 0, indeed, 
as a grows, the graph of ar widens and thus n is defined for the worst possible case. It is also 
shown in Fig. 4 how the peak widens as p decreases. 



IV. ALGORITHMS 

In section II A we introduced the general model which we adopt. In this section we will introduce 
four different algorithms to estimate the sequence while achieving the best compromise between 
precision and complexity. 

A. Transfer Matrix algorithm 

The first algorithm we introduce will be referred to as the transfer matrix algorithm (TM). 

This method relies on the combination of two iterative expressions of the position o, one for 
each direction forward and backward, respectively z[ and Z^, and which we will refer to as the 
constrained partition functions. Both are indexed by the sequence {xa-n+i, ■ ■ ■ ,Xa} G [l;c]" such 
that 

^a{Xa—n+lj---jXa) = l3{Xa) ^ ^ Z^_'^(Xa—ni---^Xa—l)'^aiXa—ni---^Xa), (19) 

•^a — n 

1 {Xa—n+1 ) • • • ) Xa-i^l 

) ■ (20) 

Xa + l 

These two functions can then be combined to write the exact marginal of {xa-n+i, • • • ,Xa} with 
respect to the distribution (6) as 

Za {^a—n+l i ■ • ■ : Xa)Z^ (x^— n+1 1 ■ ■ ■ i 3Ja) 
'^Xa-n+i,---,Xa ^"-{Xa-n+l^ ■ ■ ■ 1 X a) Z^{x a-n+l 1 ■ ■ ■ jXa) 

Therefore, the marginal distribution of variable Xa again with respect to the probability distribution 
defined in Eq. (6) is then 

l^iXa) = Jj= X] Z[{Xa-n+l,- ■ ■ ,Xa)Z^{Xa-n+l,- ■ ■ ,Xa) , (22) 

^'a — n + l ^•••i^a — l 

where A/" is a normalization constant. 

This algorithm corresponds to a reordering of the model defined in section II A. It has complexity 
of order Q^c^) which is huge in most of the regimes we are interested in. Because of this, we 
introduce in subsequent sections a set of approximations to reduce this complexity. 



l^\Xa—n+l 1 ■ ■ ■ iXa) — ^v/T- ' TTTlTZ I 7 ' \^^) 



11 



B. First order approximation 

In this section we introduce two algorithms that emerge from the same approximation to the 
TM algorithm of section IV A. They rely on a first order expansion of the constrained partition 
functions. 

We assume the constrained partition functions Z*{xa-n+i, • • • , Xa) (where the * is either / or 
b) defined in Eqs. (19) and (20) factorize approximately 

a 

Z2{Xa-n+l,---,Xa)^ JJ ^a,ii^i) ^ (^3) 
i=a— n+1 

where the z* ■ are functions of a single variable. The new iterative procedures are one variable 
nationalizations of the functions in Eqs. (19) and (20) and where the functions Z* on the right 
hand side are replaced by the approximation in Eq. (23). These can be written, for i £ [a — n + 1, a], 
as 

= "'^f ' ^(^{Xa) n z{^lj{Xj)^a{Xa-n,---,Xi,...,Xa), (24) 

nf j=a-n 

^XiiXi) = ""^^^ ' ^/3(Xa+l) JJ Z^+ij(%) ^'a+l(^a-n+l,---,a:j,...,Xa+l), (25) 

Rb j=a-n+2 

where the M* are normalization constants and where = [l,c]" (resp. i?^ j = [1,0]") is the set 

of all possible values of {Xa-n, ■ ■ ■ • • • ,^a} (resp. {Xa-n+l, ■ ■ ■ • • • ,ia+i})- 

Furthermore, since they were not estimated in any previous step, zl^_^ a(^«) 4+i a-n+ii^a-n+i) 
are both set to 1 prior to the computation of z[^a{xa) and a-n+iixa-n+i) which are the first 
values to be computed at step a in their respective directions. These values are then reinjected into 
the subsequent calculations at step a by setting zl_ig^{xa) = z[^a{xa) and z'^^-^ ^_^_^_i{xa-n+i) = 
Za a-n+ii^a-n+i)- To initiate the backwards iteration, we define z^^^{xa) = for a G {t — n, . . . , 
to account for the free boundary conditions. 

Finally, we have an approximation of the marginal of 

l^{Xa) = ^ zl^{Xa) Z^^iXa), (26) 

where A/" is a normalization constant. 

The sums in Eqs. (24) and (25) are over c" terms and thus, this algorithm, as such, has 
no benefit over the TM algorithm described in section IV A. We therefore introduce another 
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couple of approximations to this algorithm. When possible, though, we will wish to compare these 
approximations to this algorithm which we will refer to as TM.IA. 

1. First order approximation with Monte Carlo 

The first of these two approaches will be called first order approximation with Monte Carlo 
(TM.lA.MC) since it relies on random sampling from iteratively computed distributions. 

At step a of the procedure described in (24), we start by estimating zl^a{xa) knowing that 
we have computed the values of zl-{xi) for all 6 < a and in particular the values of zl_^-{xi) 
which are probability distributions over the variables Xj. We can therefore use the importance 
sampling technique. By using these distributions, we generate Nf independent random samples of 
n independent variables {xa-n, ■ ■ ■ ,Xa~i} which we use to compute the estimation 

^a,a{Xa) — zl^a{Xa) = -JJ^ '^a{Xa-n, ■ ■ ■ : Xa-l) , (27) 

where is a normalization constant and where we use the same set of samples for all Xa £ [1, c]. 
Other positions z[-{xi) are each computed in the same manner with a new sampling for each one. 

The backwards iteration is hereby discarded. Indeed, the backwards iteration of Eq. (25) will 
result in a large number of false positives. This happens because the i{xi) become very peeked 
about a mean that is not the correct Xi due to the repercussion of early errors in subsequent 
iterations. It is therefore necessary to perform the initial sums over a very large number of samples 
which defeats the purpose of this algorithm. Furthermore, empirical tests show that under the first 
order approximation, very little information is actually gained by using the backwards algorithm. 

When all iterations have been computed, we simply equate the z/-(xi) to the marginals, i.e. 
Va S [l,t] and Xa G [l,c] 

''M = zl^ixa) . (28) 

The complexity is Q{Nftn) which takes of the order of a second to decode a full chain for 
typical parameter values. 



13 



2. First order approximation with Gauss 

The second first order approximation will be referred to as first order approximation with Gauss 
(TM.IA.G). 

We make the assumption that the variable X^""^ = Yl'j=a-n ck(J) present in ^'^ at step a 
(Eq. 8) can be approximated with a Gaussian random variable. This is possible since the variables 
a(j, a)xj are independently drawn from their respective distributions under the approximation of 
Eq. (23) and we assume n is large. 

The complete derivation can be found in appendix A, but if we write the mean and variance of 
X^""^ as respectively fiXi and cj^. , the iterated marginal distribution of Xi for i G {a — n, . . . ,a — 1} 
at step a can be written as 



T^l '{xi) = exp 



2(<T^ + cr^J 



(29) 



where is a normalization constant and a similar expression for (xa) since Ua does not 
exist and is replaced by the prior of Xa- P{xa)- The final iteration for a = t returns the complete 
set of marginals fi(xj) for all i. We make the same assumption on the backwards iteration as in 
section IV B 1 . 



C. Two point algorithm 

Our final algorithm, which will be called second order approximation with Gauss (TM.2A.G) 
is similar to the algorithm described in section IV B 2 but introduces a different factorized 
approximation . 

We introduce a second order approximation, which is similar to the factorized expression in Eq. 
(23), i.e. 

a 

Za,i{Xi)W[^ + Wij{Xi,Xj)], (30) 
i=a—n+l {ij) 

where the functions Wij{xi,Xj) are very small. This expression introduces two point correlations 
in the expression of the iterative marginal. 

There is one drawback to the expression in Eq. (30) and which is that it is required that the 
functions Wij{xi,Xj) — > and numerically the control of such structures is very difficult. We thus 
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make the assumption that the factor graph is in fact one-dimensional. That is, each variable is 
connected to exactly two nodes, except for the extremities. By introducing this approximation we 
can take advantage of a decomposition property for the joint probability of an arbitrary number 
of variables taken on a tree graph [18]. Thus, at any step a, for any number of successive variables 
taken between positions i and j < a , as a relationship between the joint probability and the 
marginals, we have 

where the x^+i) and fj,^^\xk) are true marginals of the approximation /n*^"^. 

By setting interactions only between closest neighbors and taking inspiration from section IV B 2, 
we consider the variable Xj^j+i = Yl']=a-n (^{h o)xj step a to be Gaussian and we assume it has 
mean and variance respectively i+i and a\_ It then comes that the main expression for the 
two point marginal at step a for the couple {xi^XiJ^i} can be written 

2 I 2 ~ ^^x,,,+l - a{i, a)xi - a{i + 1, a)xi+if 



i^-,i+i(xi,Xi+i) = — -j^ exp 



(32) 

The details of this algorithm can be found in appendix B. 



V. BOUNDS 

In this section we describe how we test the behavior of our different approximated algorithms 
when it is impossible to compare them to the TM algorithm. This is done by introducing ap- 
proximations on the probability of error and on the average number of errors that occur when 
performing estimation. We start by studying the probability of error in the limit of a memoryless 
channel and then we discuss another approximation for a channel with memory. 

A. Memoryless channel 

In this subsection, we assume that the channel is without memory, i.e. that the parameter n 
introduced in section II A is zero and thus that Eq. (5) reduces to 

Ya = Aa^r]a, (33) 
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where rja is the random variable with mean and variance cr^ defined in Eq. (5). The probabihty 
of error for the single variable is then 



xo=l 



e 2' 



argmax 



e 2o- 



-{xo+rja-x)' 



Xo 



(34) 



where I is the indicator function. Errors thus happen when the expression (3{x)e 2<t2 in 



(34) is maximized by a value x ^ xq. If the distribution (3 is uniform this occurs if 

1 



|r/| > 



if Xo 7^ l,c, 
if Xo = 1 , 



r/ < -- if Xo = c. 



(35) 
(36) 
(37) 



In general, for other expressions of (3, these events provide a lower bound on the probability we 
seek. These lead to the very general expression for the probability of error as 

c-l 



1 



Perr > ^[^0 = 1] 

where, r/ being normal, we have 



+ 



1 



+ ^P[xo = x]P 



x=2 



(38) 



1" 




1' 




" , 1' 






= P 


< "2. 


2 


1^1 > 2 



2a 



(39) 



where 



Q{x) =F[X > x] 



+ 00 



(40) 



Therefore, the probability of error can finally be expressed as 



Perr > ^/3(1) + /3(c) + 2 /3(x)^ Q , 



(41) 



where /3 is one of the distributions introduced in section II B 1, i.e. depending on which distribution 
is being studied we will use one of the following expressions 



err,g 



> 



1 



T Q 



1 

2^ 



1C-1 



Pp.rr.t ^ 3 



1 



1 

2^ 



(42) 
(43) 
(44) 



when in /?„ and f3t we have q = \- 



16 



B. Channel with memory 

Let us now derive a lower bound on the probability of errors in the case of channel with non-zero 
memory. 

To do this, consider the probability density defined in (6) and write it as follows 

P [X\Y] = 1 /3(x„)^ e-^^^(^) , (45) 



^^a{i,a){Ai - Xi) + rja 



.i=l 



(46) 



where Z is a normalization constant and 

HAiX) = ^ 

a=l 

where all the parameters are the same as in (5). 

Using these notations, we can write the block MAP probability of error as 

P{A,a)=¥{3X- Ha{X) + B{X) < Ha{A) + B{A)] , (47) 

where B{X) is the prior and takes the form 

t 

Bg{X) = 2a^\n{2)Y,Xa (48) 

a=l 

if we consider the geometrical distribution f3g with (7 = ^, 

t 

Bt{X) = 2a\\n{2)Y,Xa - tH^t)) (49) 

a=l 

if we consider the truncated distribution f3t and 

Bu{X) = Bu = 2a^t ln(c) (50) 

if we consider the uniform distribution. 

For any X_ = G N* a vector of t strictly positive integers the probability 

^{Ha{X)+ B{X) < Ha{A) + B{A)} is a lower bound of the right hand side of Eq. (47). In 
order to estimate this probability we write 

2 t a 



HAiX) - Ha{A) = 



a=l 



Y^a{i,a){Ai - Xi) 



i=l 

t 



+ 'i'Yr]aYa{i,a){Ai-Xi), (51) 



a=l i=l 



Bg{X) - Bg{A) = 2a^ ln(2) ^(x, - ^„) , (52) 



a=l 



Bu{X)-Bu{A) = 0, (53) 
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and if we define 



t 



a=l 

t 



'Ya{i,a){Ai - Xi) 

.i=l 



(54) 



VA,x = 2'Yr]a'Ya{i,a){Ai - Xi) , (55) 

a=l i=l 

Ba,x = B{X)- B{A) , (56) 



we obtain that 



P{A, a) > P { ^A,x + VA,x + Ba,x < 0} . (57) 

Furthermore, r]A,x is a Gaussian random variable of mean and variance 4a'^TiA,x, thus, using the 
same function Q as in (40), we have 

This expression is a function of both X and A and thus is still impractical both analytically 
and numerically. We thus introduce the notation X = A^'^ which differs from A at position i only 
where it takes value x / Ai, i.e. A^'^ = {Ai, . . . , Ai_i,x, . . . , A^}. That is, we have 

P{A, a) > F {3i, X : Ha{A'''') + B(A^'") < Ha{A) + B{A)} (59) 
> max P {Ha{A'''') + BiA''"") < Ha{A) + B{A)] (60) 
= maxi?i,2;(ii) , (61) 

i,x 

where the expression of Ri^x will be given in the following. We use the previous expression and Eq. 
(58) to write a lower bound on the probability of error 

Perr ^ max max Ri ^ (^4) (62) 

i X ' 

where Ma is the expectation over the distribution of the vectors A. We can also write a lower 
bound of the expectation of the number of errors as 

t 

E{#errors} > maxRi^xiA)- (63) 

i=l ^ 

The expression of Ri,xiA) varies according to the prior distribution (3, we separate the results 
accordingly. For the geometrical distribution, we have 
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which is maximized hy x = Ai — I if Ai ^ 1 and x = 2 if Ai = 1 thus we define the function of a 
single integer 

^|EL»Mt2^| (66) 



2'^\/Ea=i"(^>«) 



We can therefore estimate the lower bounds when considering the geometrical distribution as 

Perr,g > max (^^Ri{Ai = 1) + ^ Rt{Ai + 1)) , (67) 

E3{#errors} > ^ (^-R^{Ai = 1) + )-Ri{Ai / 1)] . (68) 



In the case of the uniform distribution we have, again using as a function of a single 

integer, 

1 " 

EAmax/?i,^(i.) = - V i?i(Ai) , 



(69) 



where each Rii^Ai) in Eq. (69) is in fact solely a function of % and all have the same expression, 
thus we define 



R{i) = Q 



2a 



(70) 



and thus 



Perr.u ^ IXLEIX 



Q 



2^ 



E„{#errors} > ^R{i) ■ 



(71) 
(72) 



1=1 



Finally, the expressions of Ri,x{^ and Ri{Ai) for the truncated distribution are the same as 
for the geometrical distribution and 



err,t 



> 



1 

1-2- 



max 



(73) 
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and 



Efl^^errors} > 



- 1-2^^ 



1 



(74) 



i=l A^=l 



VI. RESULTS 



In this section we study the behavior of our various algorithms according to the parameters 
used. We will separate this section into three separate subsections according to the parameters we 
use to control our tests. 

In general, the behavior of the algorithms is the same if we are to consider either Pg or Pt, 
thus we will usually only present the results for one of the two except in section VI C where we 
emphasize the similarities. 

There are in total six algorithms we compare. These were all defined in section IV except for 
the one referred to as TM.lA.f which is the first order approximation of section IV B with the 
backwards iteration discarded. 



In this subsection we vary the noise parameter a. We have eight figures (5-12) that range over 
the set of distributions /? and functions a. For each couple (/3, a) we plot the probability of error 
Perr (Figs. 5, 7, 9 and 11) and the average number of errors ^errors (Figs. 6, 8, 10 and 12) over a 
set of independent samples. We recall that the probability of error is the probability that at least 
one error occurs during the estimation process. 

We emphasize more on the use a j and ah since they allow us to control the value of the memory 
length n and thus allows us to use TM. 

The probability of error varies from to 1 as a grows from 0. The rate of increase is dependent 
on the couple (/3, a). The average number of errors varies from to a value that depends on the 
distribution (3. 

There is one expected and obvious trend we can take out of the figures 5 to 12 and which is 
that the TM algorithm always, and usually very notably so, outperforms all the other algorithms. 
Indeed the TM algorithm is an exact implementation of bit MAP decoding. By definition, it 
minimizes the probability of error over variables. 



A. Noise as parameter 
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FIG. 5: Probability of error (Perr) for various algorithms vs a. The full line is the analytical lower bound. 
The figure on the bottom is a blowup of the one on the top for a small. With t = 100, n — 3, c — 15, using 
/?( and a/, average over 1000 samples and Nf = 500 for TM.IA.MC. 



Also, in general, we can see that the worst performing algorithm is, unsurprisingly, TM.IA.G 
though it is at par with the other algorithms for a small. 

More specifically, in Fig. 5, we see for larger a that TM.IA and TM.2A.G perform similarly 
and better than TM.IA. f and TM.IA.MC which also perform the same. All perform very well for 
a < 0.11. In Fig. 6 we see the same threshold of cr = 0.11 below which there is virtually no errors. 
Above this value we see that TM.IA and TM.2A.G remain very close, though the former slightly 
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FIG. 6: Average number of errors {ij^errors) for various algorithms vs a. The fuh hne is the analytical 
lower bound. The figure on the bottom is a blowup of the one on the top for a small. With t = 100, n = 3, 
c = 15, using fit and a/, average over 1000 samples and Nf = 500 for TM.IA. MC. 



outperforms the latter for a > 0.5. Both TM.lA.f and TM.IA.MC perform very similarly and are 
outperformed by TM.IA.G for a > 0.3 though this could be linked to the way algorithms respond 
to higher values of a. For very large values of a, the limit value of half the total length (i.e. 50 = | 
in the current example) is explained by the fact that when estimation is impossible the algorithms 
always return the estimate 1 for all positions which is the maximum of the prior distributions 
In the present case the probability that = 1 is very close to 1/2 thus the probability 1/2 of 
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FIG. 7: Same as 5 with t = 100, n = 3, c = 15, using and a/, average over 1000 samples and Nf = 500 

for TM.IA.MC. 



ending up with the correct value. 

In the case of {(3u,af), i.e. Figs. 7 and 8, we see that TM.2A.G performs very well compared 
to all the other algorithms which all perform very similarly, except for TM. The limit value in Fig. 
8 is the worst case scenario of randomly falling on the correct value. It is the total length times 
the complementary probability of randomly picking a value and therefore is = 93.33 with 

the current parameters. 

By setting the a function to in Figs. 9 to 12, we observe a similar behavior independently 
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FIG. 8: Same as 6 with t = 100, n = 3, c = 15, using (3u and a/, average over 1000 samples and Nf = 500 

for TM.IA.MC. 



of the distribution (3. Besides TM and TM.IA.G which behave according to the trends described 
previously, all algorithms perform very similarly. The only major difference is in the limit value 
for the average number of errors which is a function of (3. 

Finally, in the case of described in Eq. (16) and the distribution jSg described in (13) 
with q = 1/2, we have very similar performances for the three algorithms TM.IA.MC, TM.IA.G 
and TM.2A.G. The algorithms TM, TM.IA and TM.lA.f cannot be used with these parameters. 
Indeed, we have a relatively large memory n = 13, which is kept the same for all values of p, and 
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FIG. 9: Same as 5 with t = 100, n = 3, c = 15, using /3f and ah, average over 1000 samples and Nf = 500 

for TM.IA.MC. 

therefore TM does not fit in our computer memories and TM.IA and TM.lA.f would take several 
thousand years to compute a single sample on the computers used. These results are shown in 
Fig. 13. In this figure we show the interpolated contour lines of the probability of correct decoding 
Pcd = 1 — Perr as a function of both a and the incorporation rate p. We show in more detail in 
Fig. 14 how performance increases with p and how similar performance is for the three algorithms. 
It also shows, in this case for smaller values of p, that TM.IA.MC has a slight advantage over the 
two others. This advantage for TM.IA.MC with ar will be confirmed in VIC. 
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FIG. 10: Same as 6 with t = 100, n = 3, c = 15, using f3t and ah, average over 1000 samples and Nf = 500 

for TM.IA.MC. 

Finally, in Fig. 15 we show the average computation time^ of a single sample according to the 
algorithm versus the the probability of error at a certain a. The differences are huge, thus the log 
scale, and overall, TM.2A.G seems to give the best performance-time trade-off for t = 100 and 
n = 3. 



^ Computations were performed on one of the following CPUs: Intel Core 2 Duo E6700 at 2.66GHz, Intel Core 2 
Quad Q9550 at 2.83GHz and Intel Core 2 Duo E8500 at 3.16GHz. In spite of the variations in CPU speed, the 
magnitude of the computation times are always the same. 
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B. Memory length as parameter 

In this subsection, we consider the memory length n as being the control parameter. We will 
be setting a large so as to never have a completely decoded chain and thus we will be comparing 
the average number of errors only. 
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FIG. 12: Same as 6 with t = 100, n = 3, c = 15, using /3„ and at, average over 1000 samples and Nf = 500 

for TM.IA.MC. 



We first consider the limit case where the cutoff parameter is c = 2. This will enable us 
to compare the results between all algorithms, even the ones exponential in n, though only for 
relatively small values of the parameter. The average number of errors for aj are shown in Fig. 
16. There are very little differences between the algorithms when we consider a^, thus we do not 
show these figures. 

As in the previous subsection, TM outperforms all other algorithms quite well. Of the other 
algorithms, TM.IA returns the smallest number of errors though it seems to be caught up by 
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FIG. 13: Contour lines of the probability of correct decoding (Pcd) for various algorithms and versus the 
noise a and the incorporation rate p. Parameters are t — 100, c — 15, n = 13, using f3g and average over 

1000 samples and Nf = 500 for TM.IA.MC. 
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FIG. 14: Contour lines of Pcd = 0.8, 0.6, 0.4 and 0.2 from left to right for TM.IA.MC, TM.lA.G and 
TM.2A.G versus the noise <t and the incorporation rate p. Parameters are t = 100, c = 15, n = 13, using 
(3g and average over 1000 samples and Nf 500 for TM.IA.MC. 

TM.2A.G for larger values of n. For both distributions and A, TM.lA.f and TM.IA.MC are 
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FIG. 15: Average time per sample on a log scale and in seconds vs Perr for the various algorithms. With 
a = 0.1 when using an and a = 0.18 when using a/, t = 100, n = 3, c — 15, average over 1000 samples and 

Nf = 500 for TM.IA.MC. 
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FIG. 16: Average number of errors {ij^errors) for various algorithms vs memory length n. With t = 100, 
s = 1, c = 2, average over 1000 samples and Nf = 500 for TM.IA.MC. 

very similar though there is a slight advantage to TM.lA.f. Finally, TM.IA.G performs very much 
like TM.IA.MC in the case where the distribution (3t is used. On the other hand, it performs quite 
poorly, compared to TM.IA.MC, when is considered. 
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FIG. 17: Average time per sample on a log scale and in seconds vs memory length n. With t — 100, s — 1, 
c = 2, using f3u and a/, average over 1000 samples and Nf = 500 for TM.IA.MC. 

In Fig. 17 we show the various times per sample for the different algorithms for /3„ and aj. 
The times are exactly the same for the other possible combinations of /5 and a. This figure shows 
why we limit ourselves to n = 17 as an upper bound for TM, TM.IA and TM.lA.f. 

We now consider the case where c = 15 in Fig. 18. We show all four combinations of f3u and Pt 
with Q/ and at for the algorithms TM.IA.MC, TM.IA.G and TM.2A.G. 

In all cases, TM.2A.G outperforms the two others and performs better as n increases. In 
both cases where we consider the function it performs much better than the two others. With 
(Xf), TM.IA.MC and TM.IA.G perform about the same, in all other combinations TM.IA.MC 
performs better. In general, these two algorithms reach a plateau value relatively quickly. 

In Fig. 19 we have again shown the computation times for the various algorithms. Even 
though TM.2A.G systematically outperforms the two others in Fig. 18, its computation time 
grows sub-exponentially with n, though it does become large. 

C. Total length as parameter 

In this subsection we study the influence of the total length t. 

The first figure in this subsection, Fig. 20, shows the probability of correctly decoding each 
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FIG. 18: Average number of errors (^errors) for various algorithms vs memory length n. The full line is 
the analytical lower bound (almost always equal to zero in the upper two figures). With t — 100, s — 0.2, 
c — 15, average over 1000 samples and Nf = 500 for TM.IA.MC for all figures. 



position a {P^d) in a chain of total length t. We can see that for a bigger than a certain threshold 
close to a = 150, P^d rapidly decreases to a value close to 0.5. This can be explained by looking 
at the graph of Oj. in Fig. 4 where we can see that, for p = 0.99 and a = 140, the maximum of 
ar.(i, a) is no longer for i = a. 

In Fig. 21 we show the probability of error I^^yf foi various algorithms with (3t and p = 0.99 
as a function of the total length t. In Fig. 22 we show the same thing with (3g and p = 0.999. 
They confirm what was shown in Fig. 20. In this particular regime, i.e. with a^, TM.IA.MC 
slightly outperforms TM.IA.G and TM.2A.G. Furthermore, the two latter algorithms seem to 
behave exactly in the same way to variations of t when is used. 

Finally, in Fig. 22 we see never reaches zero. This behavior is detailed in Fig. 23 

where we see that for very small t the number of errors is non vanishing when j3g is used while it 
is when (dt is used. The reason for this is that when the chain is generated using (3g then Aa > c 
with positive probability: the decoder fails in these cases. This is the only notable difference when 
using Pg instead of (3t ■ 
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FIG. 19: Average time per sample on a log scale and in seconds vs memory length n. With t = 100, 
s = 0.2, c — 15, using Pu and a/, average over 1000 samples and Nf = 500 for TM.IA.MC. 
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FIG. 20: Probability of correctly decoding each position a {Pvd) in a chain of total length t ~ 300 for 
various algorithms. With p = 0.99, n = 13, c = 15, ct = 0.04, using (3t and a^, average over 1000 samples 

and Nf = 500 for TM.IA.MC. 



In Fig. 24 we plot the probability of error as a function of t for various values of the 
incorporation rate p for (3t and a,.. As p increases, the performance does increase as well. 
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FIG. 21: Probability of error (Perr) for various algorithms vs full chain length t. The full line is the 
analytical lower bound. With p = 0.99, n — 12>, c — \b , a ~ 0.04, using j3t and a^, average over 1000 

samples and Nf = 500 for TM.IA.MC. 




FIG. 22: Same as 21 with p = 0.999, n = 13, c = 15, ct = 0.04, using [3g and a^, average over 1000 samples 

and Nf = 500 for TM.IA.MC. 
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FIG. 23: Average number of errors {^errors) vs t for TM.IA.MC. With p = 0.99, n = 13, c = 15, 
(7 = 0.04, using ar, average over 1000 samples and Nf = 500. 
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FIG. 24: Probability of error (Perr) for TM.IA.MC vs full chain length t for various values of the 
incorporation rate p. With n= 13, c = 15, (t = 0.04, using (it and ar, average over 1000 samples and 

Nf = 500. 
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D. Comparison to real data 

In this final section, we use parameter values inspired by real data. These values (used in [10]) 
are obtained from [17] which presents pyrosequencing data generated from two sets of tests and 
where a maximum likelihood sequence detection (MLSD) algorithm is developed. These sets are 
obtained from two separate sequencing systems, the first set of tests is obtained on a Pyrosequencing 
PSQ96MA system for a set of 5 templates ranging from 55 to 224 base sub-sequences. The second 
test run consisted of 10, 000 data sets extracted using a 454 Genome Sequencer 20 (GS20) system. 
These two machines are examples of two separate generations of sequencing systems having different 
precisions. The mathematical model which was developed to apply the MLSD algorithm was one 
of the inspirations for this work to which we applied an additional set of approximations and which 
we generalized. 

They estimate that the incorporation rate is p = 0.9955 for the PSQ96MA system and p = 
0.9987 for the GS20, while the read error introduces a Gaussian noise with a standard deviation of 
a = 0.08 in both cases. In our approach, we have considered a modified base sequence composed 
of a two letter alphabet in place of the 4 letters that compose DNA strands. The output being 
read as the number of repetitions of a single base in both cases (HP sub-sequences), the main 
difference in both approaches is seen in the modification of the memory function a (16) which has 
a simplified expression from that of the similar relation defined in [17]. Though we estimate that 
this consideration does not modify significantly the read length and therefore that the behavior 
of our 2-base model will be similar to that of the real 4-base sequences. One thing that might 
lack is the non-specific incorporation rate that involves false positives. This difference can result 
in errorless read lengths that are higher than in experimental data, though its estimated value 
smaller than 0.02 will result in only a slight increase. These issues will be further addressed in a 
future communication, here we will limit ourselves to presenting a few results on the behavior of 
our model using the parameters introduced at the beginning of this paragraph. 

In figure 25 we have plotted the probability of correctly decoding each position of a chain of 
length t = 300 for p = 0.9955 for three algorithms. This would emulate extracting sequences using 
the PSQ96MA system. As already mentioned in previous sections, TM.IA.MC performs better 
than the Gauss algorithms with these parameters for higher values of the position a within the 
chain. In figure 26 we plot the same probability for a chain of total length t = 900 with p = 0.9987 
and which, in this case, would emulate extraction on a GS20 system. In [17], the MLSD algorithm 
was capable of correctly reading 170 base sub-sequences out of 208 and 205 out of 228 of the longest 
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FIG. 25: Probability of correctly decoding each position a [Pyd) in a chain of total length t = 300 for 
various algorithms. With p = 0.9955, {n = 11), c = 15, ct = 0.08, using /3g and a^, average over 1000 

samples and Nf = 500 for TM.IA.MC. 




FIG. 26: Same as 25 with p = 0.9987, (n = 9). 



two templates ran on the PSQ96MA and all bases for templates of length 168 on the GS20. By 
comparing these results to our own, we can say that they are at least consistent. Indeed, in figure 
25 we have P^d > 0.9 for values of a < 230. We can say this since the presented algorithms do 
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not depend directly on the total length t but only on the position a. Furthermore, for a total 
length of t = 300 the average number of errors encountered during the decoding process of the 
TM.lA.MC algorithm is 7.815 it 0.153 over 1000 samples. Similar results are also true concerning 
data presented in figure 26 although all we can say in comparison to [17] is that we do indeed 
decode correctly chains of total length 168. Also, they predict that correct decoding is possible for 
sequences of length greater than 500 which we easily obtain (we have an average number of errors 
of 9.044 ± 0.164 in decoding chains of length t = 900 with TM.lA.MC). 

We conclude that our data is in good concordance with results obtained experimentally in [17], 
though we emphasize again that direct testing of our algorithms still needs to be performed. 

VII. CONCLUSION 

We defined three low complexity algorithms based on the original high complexity transfer 
matrix algorithm (TM) described in IV A. Of these three, two are first order approximations: the 
Monte Carlo algorithm (TM.lA.MC) and the Gauss algorithm (TM.IA.G) described respectively 
in IV Bl and IV B 2. The final algorithm is a second order approximation that is based on the 
Gauss algorithm (TM.2A.G) and is described in IV C. The performances of these algorithms were 
studied in the previous section VI where we show that the second order approximation with Gauss 
(TM.2A.G) is the best performing algorithm when the memory n is small and the first order 
approximation with Monte Carlo (TM.lA.MC) performs best when the memory n is large. 

Besides its direct application to hidden Markov Models of higher order, the description 
introduced in this paper has yet to be tested on real pyrosequencing data. The context being 
different and the variations that need to be brought to our approach being quite substantial, this 
issue will be addressed directly in a future communication though preliminary comparison to 
similar work is promising. 

APPENDIX A: GAUSS ALGORITHM 

As stated in section IV B 2, we assume that the variable X^""^ = Ylj=a-n ot{j: present in 
at step a (Eq. 8) can be approximated with a Gaussian random variable. 
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There are two steps for each iteration to derive the marginal distribution of X = {xi, . . . ,xt} 
using this method. 

We suppose the iterative process of the transfer matrix has brought us to position a. Then, the 
first step is to calculate the mean and variance of the variable Xa = YliZl-n '^(^' under the 
Gaussian approximation. These can be written, respectively, as 

a-l 

^x=nX)= ^ a(z,a) E("-i)(xi), (Al) 



i=a—n 
a-l 



a^ = Var(X)= ^ a{i,af \aT^''-^\xi) , 



(A2) 



where WS-'^~'^\xi) and Var^'*^"^^ (xj) are the expectation and variance calculated at position i using 
the distributions v^"" '^\xi) which where obtained at the previous iteration. 

We can then define the Gaussian variable x such that its probability density function is 



Fix) 



27r(T| 



: exp 



24 



(x - iixf 



(A3) 



[Va-x- a{a,a)xa] 



and finally, the probability of having value Xa at step a is 

^\^\xa) = ^ |dxP(x)exp[-^_ 
which yields 

^ r 1 

■[y; - a{a,a)xa - ^lx?' 



(a)( \ Pi^a) 

where is a normalization constant. 
We then keep 



2(a2 + 4) 



(A4) 



(A5) 



E(«)(x,) = Y,xv^\x), 

x=\ 
c 

Var(")(xa) = ^x2i.W(x) -E('^)(xa) 



(A6) 
(A7) 



for the next step and for subsequent iterations. 



The second step is then to notice that for all i G {a — n, . . . , a — 1} we can rewrite Eq. (24) as 
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^/3(xa)exp 
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where is a normalization factor. 

Then we define step and position dependent mean and variance of the variables X^^"^ 
Y!j=i a(i, a)xj as 



^^x,=W.{Xi) = ^ a{j,a)Ef '\x,) , (A9) 



j=a-n 



ai^=Var(Xi) = a{j,af Y^if '\xj) , (AlO) 



j=a— n 

where we artificially set eI" ^\xa) = ^a^\xa) and the same for the variance. From these and a 



similar expression to Eq. (A4) we recover Eq. (29). 

For subsequent iterations, we keep k["'\x,i) and Var-"^(xj). 



APPENDIX B: TWO POINT ALGORITHM 

As stated in section IV C we consider two point interactions only over closest neighbors. Since we 
can reconstruct one-point marginals as marginals of the two-point marginals, we need only compute 
the latter. We use a similar method to the one used in section IV B 2 through approximating our 
sum variable X with a Gaussian random variable with mean and variance which are expressed in 
the following two step procedure. 

We start by calculating ^xxa-i,a ^^id respectively mean and variance of Xa-i^a = 

a-2 

= nXa-i,a) = J2a{k,a) E("-i)[xfc|xa_i] , (Bl) 
k=l 
a-2 

= Var(X„_i,J = ^a(A:,a)2 Var("-^)(xfc|x,„i) + 
k=l 

) , (B2) 



l<fc<fe'<a-l 



where 



Yar^--'\xk\xa-i) = E^^-'^[xl\xa-i] - E^'^-'^ix.lxa-if , (B3) 
Cov^^-^HxkXk'lxa-i) = E^^-^^ixkXk'K-i] -E^^-^^ixklxa-ijE^^-^^ixk'lxa-i] , (B4) 
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and which yield 



V"; \ 



(B5) 



Once these values obtained, we define v^\xa) = Yl'xa-i=i ^a-i a(^«-ii ^a) and 

c 

m„ = ^ X, i^l^\xa), (B6) 

(B7) 



3:a = l 

C 



which are injected into the second step of the procedure which is to compute v\°'^j^^{xi,Xi+i) for 
i < a — \. Again, we approximate the sum variable Xj^j+i = j=i a{j,a)xj with a Gaussian 
random variable with mean and variance respectively 



a-l 



= ma + = ma+ ^ a(/c,a)E('' ^\xk\xi,Xi+i] 



(B8) 



fc=i 

a-l 



= "f^a + Var(Xi_i+i) = 7;^+ ^ a)^ Var^" ^\xk\xi,Xi+i) + 



k=l 



+2 a)a(A;', a) Cov*-" "^^(xjfcj;fc'|xj, Xj+i) , (B9) 



l<k<k'<a 
k,k'^i,i+l 



which enable us to compute 

( ■y ■ — ii \Ya—Ux ■ , -, —ci(i,a)xi—a(i+l ,a)xi+i]^ 



(1) / \ 



(BIO) 



which is the expression in (32). 



In writing these equations (Bl - BIO) we used several expressions that we need to give more 
detail to. First of all, in the situation where a distribution can be decomposed on one dimensional 
factor graph, we can express the conditional probabilities as 

H{xj\xi) = ^ n{xi+i\xi) . . . fj,{xj\xj^i) , (Bll) 

where ^ expresses probabilities in general. 

We can then use the previous expression to compute the necessary expectations for Eqs. (B5) 
and (BIO), in all following cases i < j and k < k' when necessary. 



(B12) 



Xk = l 
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c 

= ^ XkXk' u'^''\xi\Xk')u'^''\xk'\Xk) 



c 



/")(x,|xfc)i/W(arfc|x,-) 

(Xi\Xj) 



\fk<i<k' , (B13) 

if i < A; , (B14) 

otherwise , (B15) 

if A; < i , (B16) 

if A; > j, (B17) 

otherwise , (^18) 



= W.^''\xkXk'\xj\ 
= W.^''\xk\xi]W^''\xk>\xi,Xj] 

= ¥!^''\xk\x,,Xj\¥.^''\xk'\xj] 

= W.^"'\xk\xi,Xj\W.^''\xk>\xi,Xj] 

v^''\xi\xky''\xMv^''\xk'\xj) 



= ^ XkXk' 



if A;' < i , (B19) 
if A; > j , (B20) 
if A; < z < A;' < j , (B21) 
if i < A; < j < A:' , (B22) 
if A; < i < j < A;' , (B23) 
(B24) 



Hi <k <k' < j . 



Finally, in most cases, the values of Xj+i) obtained numerically, though they do contain 

the information we are looking for, are not a distribution. That is, more often than not do we 
get ^xi+i '^i'f+iixi, Xi+i) 7^ ^xi-i ^i-ii{xi-i,Xi). To overcome this we introduce a new set of real 
distributions Xj+i) such that the Kullback-Leibler divergences, regarded as distances, 

between the lJ-ff_^_i and the I'^'f^i are minimized. That is, we wish to minimize the quantity 



i 

given the constraints 



That is, by using Lagrange multipliers, we wish to minimize 



(B25) 



(B26) 



(B27) 
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which, after differentiation, results in 



log f^^^^ltl^] = A.(x.) - A.+i(x.+0 , (B28) 
for all i and all couples We define ji{xi) = e'^^^^^\ hence we have 



(a) / \ ^i{xi) (a) , s /Dnn\ 



Furthermore 



^ iif\^Xi-x,Xi) = — ^ ^ z^}"\ .(xi_i,x,)7i_i(x,_i) , (B30) 

where the left hand side of Eqs. (B30) and (B31) are equal by the constraint (B26) and thus by 
equating the right hand sides we have 

=- -• (B32) 

'^i,i+i(a;i,a;i+i)/7i+i(xi+i) 

We initiate the procedure by setting all values of ^i{xi) in the right hand side to the value 1. The 
values obtained on the left hand side are then reinjected into the right hand side iteratively until 
the values of the 7j(xj) of all i and Xi are equal on both sides of the equation. To prevent the 
appearance of static cycles, we produce the update at each step by randomly choosing the order 
in which we take the functions ji for all i. 

Finally, the are injected into the next step a + 1, taking the place of the 



ACKNOWLEDGMENTS 

I wish to thank Andrea Montanari and Guilhem Semerjian for their unbounded help and support 
with this work. 



[1] L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," 
Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, 1989. 

[2] K. Karplus, C. Barrett, and R. Hughey, "Hidden Markov models for detecting remote protein homolo- 
gies," Bioinformatics, vol. 14, no. 10, pp. 846-856, 1998. 



43 



O. Cappe, E. Moulines, and T. Ryden, Inference in Hidden Markov Models. Springer, 2005. 

O. Zuk, I. Kanter, and E. Domany, "The Entropy of a Binary Hidden Markov Process," Journal of 

Statistical Physics, vol. 121, no. 3, pp. 343-360, 2005. 

L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, "Optimal Decoding of Linear Codes for Minimizing 
Symbol Error Rate," IEEE Transactions on Information Theory, vol. 20, no. 2, pp. 284-287, 1974. 
L.-M. Lee and J.-C. Lee, "A Study on High-Order Hidden Markov Models and Applications to Speech 
Recognition," Advances in Applied Artificial Intelligence, pp. 682-690, 2006. 

Y. Wang, L. Zhou, J. Feng, J. Wang, and Z.-Q. Liu, "Mining Complex Time-Series Data by Learning 
Markovian Models," Data Mining, 2006. ICDM '06. Sixth International Conference on, pp. 1136-1140, 
Dec. 2006. 

D. J. MacKay, Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 
1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005. 

M. Ronaghi, M. Uhlen, and P. Nyren, "DNA SEQUENCING: A Sequencing Method Based on Real- 
Time Pyrophosphate," Science, vol. 281, no. 5375, pp. 363-365, 1998. 

H. Eltoukhy and A. El Gamal, "Modeling and base-calling for DNA Sequencing-By-Synthesis," Acous- 
tics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Confer- 
ence on, vol. 2, pp. H-H, 2006. 

E. Elahi and M. Ronaghi, "Pyrosequencing: A Tool for DNA Sequencing Analysis," Methods in Molec- 
ular Biology, vol. 255, pp. 211-219, 2004. 

M. Ronaghi, "Pyrosequencing Sheds Light on DNA Sequencing," Genome Research, vol. 11, no. 1, 
pp. 3-11, 2001. 

M. Ronaghi, "Improved Performance of Pyrosequencing Using Single-Stranded DNA-Binding Protein," 
Analytical Biochemistry, vol. 286, no. 2, pp. 282-288, 2000. 

B. Ewing and P. Green, "Base-Calling of Automated Sequencer Traces Using Phred. II. Error Proba- 
bilities.," Genome Research, vol. 8, no. 3, pp. 186-194, 1998. 

P. Nyren, "History of Pyrosequencing," Methods in Molecular Biology, vol. 373, pp. 1-14, 2007. 
A. Svantesson, P. O. Westermark, J. H. Kotaleski, B. Gharizadeh, A. Lansner, and P. Nyren, "A 
mathematical model of the Pyrosequencing reaction system," Biophysical Chemistry, vol. 110, no. 1-2, 
pp. 129 - 145, 2004. 

H. Eltoukhy, An integrated system for de novo DNA sequencing. PhD thesis, Stanford University, 2006. 
M. Mezard and A. Montanari, Information, Physics and Computation. Oxford University Press, 2009. 



